The main aim is predicting breast cancer patients chance of survival.
- Clean the data
- Augment the data
- Create some plots
- Statistical analysis
- Create the prediction tool
The main aim is predicting breast cancer patients chance of survival.
We are working with a dataset about Breast Cancer that we have obtained from kaggle website
This is the dataset we are working with:
## patient_id education id_healthcenter id_treatment_region ## 111035895969: 1 Diploma :253 1110000154: 14 1110000329:284 ## 111035896483: 1 Elementary :112 1110000280: 11 1110000330:260 ## 111035897677: 1 Middle School: 97 1110000303: 11 1110000331:189 ## 111035897739: 1 Bachelor : 82 1110000181: 10 ## 111035897959: 1 Illiterate : 79 1110000305: 10 ## 111035898167: 1 High School : 55 1110000224: 9 ## (Other) :727 (Other) : 55 (Other) :668 ## hereditary_history birth_date age weight ## 0:309 Min. :1944 Min. :20.00 Min. : 35.00 ## 1:424 1st Qu.:1978 1st Qu.:29.00 1st Qu.: 73.00 ## Median :1985 Median :34.00 Median : 79.00 ## Mean :1982 Mean :36.83 Mean : 78.74 ## 3rd Qu.:1990 3rd Qu.:41.00 3rd Qu.: 87.00 ## Max. :1999 Max. :75.00 Max. :101.00 ## NA's :2 ## thickness_tumor marital_status marital_length pregnency_experience ## Min. :0.0100 0:138 above 10 years:409 0:143 ## 1st Qu.:0.4000 1:595 under 10 years:324 1:590 ## Median :0.6000 ## Mean :0.5753 ## 3rd Qu.:0.8000 ## Max. :1.3000 ## ## giving_birth age_FirstGivingBirth abortion blood taking_heartMedicine ## 1 :364 above 30:428 0:593 A+ :176 0:280 ## 0 :136 under 30:305 1:140 A- :124 1:453 ## 2 :128 AB+ :119 ## 3 : 75 B+ :108 ## 4 : 13 O+ : 77 ## 5 : 12 (Other):128 ## (Other): 5 NA's : 1 ## taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking ## 0:210 0:343 0:502 ## 1:523 1:390 1:231 ## ## ## ## ## ## alcohol breast_pain radiation_history Birth_control menstrual_age ## 0:454 0:280 0:365 0:261 above 12:304 ## 1:279 1:453 1:368 1:472 not yet : 2 ## under 12:427 ## ## ## ## ## menopausal_age Benign_malignant_cancer condition treatment_age ## above 50: 36 Benign :303 death :350 Min. :20.00 ## not yet :643 Malignant:430 recovered :132 1st Qu.:29.50 ## under 50: 52 under treatment:251 Median :34.00 ## NA's : 2 Mean :36.85 ## 3rd Qu.:41.00 ## Max. :75.00 ## NA's :2
| BEFORE | AFTER |
|---|---|
| THE COLUMNS ARE DIFFERENT TYPES | EACH COLUMN HAS A CORRECT TYPE |
| 0, 1, 2 VALUES | BOLEAN VARIABLES |
| NAMES WITH /R/N | CLEAN NAMES |
| BIRTH DATE WITH 3 CHARACTERS | BIRTH DATE WITH 4 CHARACTERS |
| BLOOD TYPE 44 | CORRECT BLOOD TYPES ONLY |
| WEIRD WEIGHT/AGE CORRELATIONS | ELIMINATING PEOPLE UNDER 20 YEARS OLD AND 35 KG |
| WOMEN AND MEN | ONLY WOMEN |
We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”
//: # Variables that affect health (medicines, vicious habits) have a great incidence in breast cancer patients //: # Early menstrual periods before age 12 and starting menopause after age 55 expose women to hormones longer, raising their risk of getting breast cancer
//: # In most cases, when having taking medicine the death is higher (no sense). //: # Not drinking alcohol or smoking improves recovery. //: # When taking alcohol and smoking the death is lower (it doesn’t make any sense) //: # These are absolute values, maybe we should calculate some relative values
We have reached the following conclusions